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OPTIMAL RATES OF CONVERGENCE FOR COVARIANCE 
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Covariance matrix plays a central role in multivariate statistical 
analysis. Significant advances have been made recently on developing 
both theory and methodology for estimating large covariance matri- 
ces. However, a minimax theory has yet been developed. In this paper 
we establish the optimal rates of convergence for estimating the co- 
variance matrix under both the operator norm and Frobenius norm. 
It is shown that optimal procedures under the two norms are differ- 
ent and consequently matrix estimation under the operator norm is 
fundamentally different from vector estimation. The minimax upper 
bound is obtained by constructing a special class of tapering estima- 
tors and by studying their risk properties. A key step in obtaining the 
optimal rate of convergence is the derivation of the minimax lower 
bound. The technical analysis requires new ideas that are quite dif- 
ferent from those used in the more conventional function/sequence 
estimation problems. 

1. Introduction. Suppose we observe independent and identically dis- 
tributed p-variate random variables Xi, . . . , X n with covariance matrix S pxp 
and the goal is to estimate the unknown matrix S pxp based on the sam- 
ple {Xj : i = 1, . . . , n}. This covariance matrix estimation problem is of fun- 
damental importance in multivariate analysis. A wide range of statistical 
methodologies, including clustering analysis, principal component analysis, 
linear and quadratic discriminant analysis, regression analysis, require the 
estimation of the covariance matrices. With dramatic advances in technol- 
ogy, large high-dimensional data are now routinely collected in scientific 
investigations. Examples include climate studies, gene expression arrays, 



Received October 2008; revised September 2009. 
Supported in part by NSF Grant DMS-06-04954. 

Supported in part by NSF Grants DMS-05-04387 and DMS-06-04571. 
3 Supported in part by NSF Career Award DMS-06-45676. 
AMS 2000 subject classifications. Primary 62H12; secondary 62F12, 62G09. 
Key words and phrases. Covariance matrix, Frobenius norm, minimax lower bound, op- 
erator norm, optimal rate of convergence, tapering. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2010, Vol. 38, No. 4, 2118-2144. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



T. T. CAI, C.-H. ZHANG AND H. H. ZHOU 



functional magnetic resonance imaging, risk management and portfolio al- 
location and web search problems. In such settings, the standard and most 
natural estimator, the sample covariance matrix, often performs poorly. See, 
for example, Muirhead (1987), Johnstone (2001), Bickel and Levina (2008a, 
2008b) and Fan, Fan and Lv (2008). 

Regularization methods, originally developed in nonparametric function 
estimation, have recently been applied to estimate large covariance matrices. 
These include banding method in Wu and Pourahmadi (2009) and Bickel 
and Levina (2008a), tapering in Furrer and Bengtsson (2007), thresholding 
in Bickel and Levina (2008b) and El Karoui (2008), penalized estimation in 
Huang et al. (2006), Lam and Fan (2007) and Rothman et al. (2008), regu- 
larizing principal components in Johnstone and Lu (2009) and Zou, Hastie 
and Tibshirani (2006). Asymptotic properties and convergence results have 
been given in several papers. In particular, Bickel and Levina (2008a, 2008b), 
El Karoui (2008) and Lam and Fan (2007) showed consistency of their es- 
timators in operator norm and even obtained explicit rates of convergence. 
However, it is not clear whether any of these rates of convergence are opti- 
mal. 

Despite recent progress on covariance matrix estimation there has been 
remarkably little fundamental theoretical study on optimal estimation. In 
this paper, we establish the optimal rate of convergence for estimating the 
covariance matrix as well as its inverse over a wide range of classes of covari- 
ance matrices. Both the operator norm and Frobenius norm are considered. 
It is shown that optimal procedures for these two norms are different and 
consequently matrix estimation under the operator norm is fundamentally 
different from vector estimation. In addition, the results also imply that the 
banding estimator given in Bickel and Levina (2008a) is sub-optimal under 
the operator norm and the performance can be significantly improved. 

We begin by considering optimal estimation of the covariance matrix 
S over a class of matrices that has been considered in Bickel and Lev- 
ina (2008a). Both minimax lower and upper bounds are derived. We write 
fln ~ b n if there are positive constants c and C independent of n such that 
c < OL n /b n < C. For a matrix A its operator norm is defined as ||^4|| = 
sup|| a .|| 2=1 1 1 ^4a? 1 1 2 - We assume that p< exp^n) for some constant 7 > 0. 
Combining the results given in Section 3, we have the following optimal 
rate of convergence for estimating the covariance matrix under the operator 
norm. 

Theorem 1. The minimax risk of estimating the covariance matrix £ 
over the class V a given in (3) satisfies 
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The minimax upper bound is obtained by constructing a class of taper- 
ing estimators and by studying their risk properties. It is shown that the 
estimator with the optimal choice of the tapering parameter attains the 
optimal rate of convergence. In comparison to some existing methods in 
the literature, the proposed procedure does not attempt to estimate each 
row/column optimally as a vector. In fact, our procedure does not opti- 
mally trade bias and variance for each row/column. As a vector estimator, 
it has larger variance than squared bias for each row/column. In other words, 
it is undersmoothed ELS £1 vector. 

A key step in obtaining the optimal rate of convergence is the derivation of 
the minimax lower bound. The lower bound is established by using a testing 
argument, where at the core is a novel construction of a collection of least 
favorable multivariate normal distributions and the application of Assouad's 
lemma and Le Cam's method. The technical analysis requires ideas that are 
quite different from those used in the more conventional function/sequence 
estimation problems. 

In addition to the asymptotic analysis, we also carry out a small simu- 
lation study to investigate the finite sample performance of the proposed 
estimator. The tapering estimator is easy to implement. The numerical per- 
formance of the estimator is compared with that of the banding estimator 
introduced in Bickel and Levina (2008a). The simulation study shows that 
the proposed estimator has good numerical performance; it nearly uniformly 
outperforms the banding estimator. 

The paper is organized as follows. In Section 2, after basic notation and 
definitions are introduced, we propose a tapering procedure for the covari- 
ance matrix estimation. Section 3 derives the optimal rate of convergence 
for estimation under the operator norm. The upper bound is obtained by 
studying the properties of the tapering estimators and the minimax lower 
bound is obtained by a testing argument. Section 4 considers optimal esti- 
mation under the Probenius norm. The problem of estimating the inverse 
of a covariance matrix is treated in Section 5. Section 6 investigates the nu- 
merical performance of our procedure by a simulation study. The technical 
proofs of auxiliary lemmas are given in Section 7. 

2. Methodology. In this section we will introduce a tapering procedure 
for estimating the covariance matrix S pxp based on a random sample of p- 
variate observations Xi,...,X n . The properties of the tapering estimators 
under the operator norm and Probenius norm are then studied and used to 
establish the minimax upper bounds in Sections 3 and 4. 

Given a random sample {Xi, . . . ,X n } from a population with covariance 
matrix £ = S pxp , the sample covariance matrix is 

1 n 

_£ ( x I -x)(x I _xf I 

(=1 
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which is an unbiased estimate of E, and the maximum likelihood estimator 
of E is 

n 

(2) E* = (<7f.)i<i,i<p = - E( X ' " X K X ' " X ) T ' 

1=1 

when X;'s are normally distributed. These two estimators are close to each 
other for large n. We shall construct estimators of the covariance matrix E 
by tapering the maximum likelihood estimator E*. 

Following Bickel and Levina (2008a) we consider estimating the covariance 
matrix E pxp = (cy)i<j,j<p over the following parameter space: 

T a = F a (M ,M) = | E : maxj^flffijl : \i - j\ > k} < Mk~ a 

^ i 

(3) 

for all k, and A max (E) < Mq 

where A max (E) is the maximum eigenvalue of the matrix E, and a > 0, M > 
and Mq > 0. Note that the smallest eigenvalue of any covariance matrix in 
the parameter space F a is allowed to be which is more general than the 
assumption in (5) of Bickel and Levina (2008a). The parameter a in (3), 
which essentially specifies the rate of decay for the covariances (Tm as they 
move away from the diagonal, can be viewed as an analog of the smooth- 
ness parameter in nonparametric function estimation problems. The optimal 
rate of convergence for estimating E over the parameter space J- a (Mo,M) 
critically depends on the value of a. Our estimators of the covariance ma- 
trix E are constructed by tapering the maximum likelihood estimator (2) as 
follows. 

Estimation procedure. For a given even integer k with 1 < k < p, we 
define a tapering estimator as 

(4) E = Efc = (wijO-*j) pX p, 

where a?, are the entries in the maximum likelihood estimator E* and the 
weights 

( 5 ) w ij = K 1 ^ ~ - ( k h - 

where kh = k/2. Without loss of generality we assume that k is even. Note 
that the weights Wij can be rewritten as 

' 1, when \i — j\ < kh, 

I ^ — J I 

2 ; , when kh < \i — j\ < k, 

kh 

, 0, otherwise. 
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k_h k 

FlG. 1. The weights as a function of \i — j\. 

See Figure 1 for a plot of the weights Wij as a function of \i — j\. 

The tapering estimators are different from the banding estimators used 
in Bickel and Levina (2008a). It is important to note that the tapering 
estimator given in (4) can be rewritten as a sum of many small block matrices 
along the diagonal. This simple but important observation is very useful for 
our technical arguments. Define the block matrices 

M* (m) = (<t?-J{Z < i < I + m, I < j < I + m}) pxp 

and set 

v 

2=1— m 

for all integers 1 — m < I < p and m > 1. 

Lemma 1. The tapering estimator given in (4) can be written as 
(6) ± k = k^\S<V -S< kh ^). 

It is clear that the performance of the estimator depends on the choice 
of the tapering parameter k. The optimal choice of k critically depends on 
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the norm under which the estimation error is measured. We will study in the 
next two sections the rate of convergence of the tapering estimator under 
both the operator norm and Frobenius norm. Together with the minimax 
lower bounds derived in Sections 3 and 4, the results show that a tapering es- 
timator with the optimal choice of k attains the optimal rate of convergence 
under these two norms. 

3. Rate optimality under the operator norm. In this section we will es- 
tablish the optimal rate of convergence under the operator norm. For 1 < q < 
oo, the matrix £ g -norm of a matrix A is defined by ||^4||q = max|| a .|| (?=1 ||Ax|| g . 
The commonly used operator norm || • || coincides with the matrix ^-norm 
|| • H2. For a symmetric matrix A, it is known that the operator norm ||A|| is 
equal to the largest magnitude of eigenvalues of A. Hence it is also called the 
spectral norm. We will establish Theorem 1 by deriving a minimax upper 
bound using the tapering estimator and a matching minimax lower bound 
by a careful construction of a collection of multivariate normal distribu- 
tions and the application of Assouad's lemma and Le Cam's method. We 
shall focus on the case p > ?7, 1 /( 2a+1 ) in Sections 3.1 and 3.2. The case of 
p < n l ^ 2a+1 \ which will be discussed in Section 3.3, is similar and slightly 
easier. 

3.1. Minimax upper bound under the operator norm. We derive in this 
section the risk upper bound for the tapering estimators defined in (6) un- 
der the operator norm. Throughout the paper we denote by C a generic 
positive constant which may vary from place to place but always depends 
only on indices a, Mq and M of the matrix family. We shall assume that 
the distribution of the AYs is sub-Gaussian in the sense that there is p > 
such that 

(7) P{|v T (Xi-EXi)|>t}<e~' 2p/2 for alU > and ||v|| 2 = 1. 

Let Va. = 'PaiMo, M, p) denote the set of distributions of Xi that satisfy (3) 
and (7). 

Theorem 2. The tapering estimator defined in (6), of the covari- 
ance matrix T, pxp with p > n l ^ 2a+l ^> satisfies 

(8) su P E||S fc - S|| 2 < C k + } ° gP + Ck~ 2a 

I a 

for k = o(n), logp = o(n) and some constant C > 0. In particular, the esti- 
mator T, = Sfc with k = n l ^ 2a+1 ^ satisfies 

(9) supE||S - S|| 2 < Cn- 2Q /( 2a+1 ) + C^. 
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From (8) it is clear that the optimal choice of k is of order n l ^ 2a+l \ The 
upper bound given in (9) is thus rate optimal among the class of the tapering 
estimators defined in (6). The minimax lower bound derived in Section 3.2 
shows that the estimator S& with k = n 1 /^ 1 ) is in fact rate optimal among 
all estimators. 

PROOF of Theorem 2. Note that S* is translation invariant and so is 
S. We shall thus assume ji = for the rest of the paper. Write 

1 n 1 n 

= - E( x < - x x x < - X ) T = - £ x < x ^ - xxT > 

l=i l=i 

where XX is a higher order term (see Remark 1 at the end of this sec- 
tion). In what follows we shall ignore this negligible term and focus on the 
dominating term - J2?=i 2QX^\ 

Set E = i Ya=i x ' x f and write s = (°ij)i<i,j<p- Let 

(10) S = {&ij)l<i,j<p = { w ij^ij)l<i,j<p 

with Wij given in (5). Let Xj = (X[,X l 2 , . . . ,X l p ) T . We then write &ij = 

h Ef=i x i x j- lt is eas y to see 

(11) Eaij = aij, 

(12) Var(ay) = - Var(X?Xj) < ^E(X^j) 2 < i[E(^) 4 ] 1/2 [IE(Xj) 4 ] 1/2 < -, 

that is, is an unbiased estimator of with a variance 0(l/ra). 
We will first show that the variance part satisfies 

(13) E ||S-ES|| 2 <C^±]^ 

n 

and the bias part satisfies 

(14) ||ES-S|| 2 <Ck~ 2a . 
It then follows immediately that 

EIIS - SIP < 2EIIS -ESf + 2IIES - SIP < 2C[ k + logp + fc-2a 



This proves (8) and equation (9) then follows. Since p > n 1// ( 2a+1 - ) , we may 
choose 

(15) k = n V(2«+i) 

and the estimator S with A; given in (15) satisfies 



EIIS - Sll 2 < 2C[ n -2«/(2a+l) + 



logp 



n 
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Theorem 2 is then proved. □ 

We first prove the risk upper bound (14) for the bias part. It is well known 
that the operator norm of a symmetric matrix A = (aij) pxp is bounded by 
its t\ norm, that is, 

v 

II All < ||A|h = max > |a,-,-| 

[see, e.g., page 15 in Golub and Van Loan (1983)]. This result was used in 
Bickel and Levina (2008a, 2008b) to obtain rates of convergence for their 
proposed procedures under the operator norm (see discussions in Section 
3.3). We bound the operator norm of the bias part EE — S by its i\ norm. 
Since Ecr^- = Oij , we have 

Et-^ = (( Wij -l)aij) pxp , 

where Wij £ [0, 1] and is exactly 1 when \i — j\ < k, then 

<M 2 k~ 2a . 



|ES - S|| 2 < 



n 2 

max > |(Tj 
i=i,..., P ^ 1 • 
y. \i-j\>k 



Now we establish (13) which is relatively complicated. The key idea in 
the proof is to write the whole matrix as an average of matrices which 
are sum of a large number of small disjoint block matrices, and for each 
small block matrix the classical random matrix theory can be applied. The 
following lemma shows that the operator norm of the random matrix S — EE 
is controlled by the maximum of operator norms of p number of k x k random 
matrices. Let = <i < I + m,l < j <l + rn}) pxp . Define 

iV, = max \\M, —KM, \\. 
!</<p-m+l 



Lemma 2. Let £ be defined as in (6). Then 

\\t -ES|| < 3N^ m \ 

For each small mx m random matrix with m = k, we control its operator 
norm as follows. 

Lemma 3. There is a constant p\ > such that 
(16) P{iv/ m) >x}< 2p5 m exp(-nx 2 Pl ) 

for all < x < pi and 1 — m < I < p. 
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With Lemmas 2 and 3 we are now ready to show the variance bound (13). 
By Lemma 2 we have 

E||S - E£|| 2 < 9E(iv/ m) ) 2 = 9E(iV/ m) ) 2 [J(iV/ m) < x) + /(iv/ m) > x)} 

<9[x 2 + E(N^ m) ) 2 I(N^ m) >x)]. 

Note that ||ES|| < ||S||, which is bounded by a constant, and ||S|| < 
The Cauchy-Schwarz inequality then implies 

E||£-E£|| 2 <C\[x 2 + E(\\t\\ 2 F + C)I(Nj; m) > x)] 



<C\[x 2 + JE(||S|| F + C)VP(iv/ m) >x 



Set x = 4y lo ^j~ m - Then x is bounded by p\ as n — > oo. From Lemma 3 we 
obtain 



E||£-E£|| 2 < C 

(17) 



l ° gP + m +p 2 -(p5 m .p-*e-^/ 2 



n 

logp + m 
n 



— — T 

Remark 1. In the proof of Theorem 2, the term XX was ignored. It is 
not difficult to see that this term has negligible contribution after tapering. 
Let H = XX T and H = (hij) pxp . Define 

H[ m) = (hijl{l <i<l + m,l<j <l + m}) pxp . 
Similarly to Lemma 3, it can be shown that 

(18) P( max ||#, (m) -E#, (m) || > t\ <2p5 m exp(-ntp 2 ) 

ll</<p— m+l J 

for all < t < p 2 and 1 - m < I < p. Note that EH = ^S, then 

E||#|| 2 < 2E\\H - EH\\ 2 + 2||Kff || 2 < 2E||fT - EF|| 2 + 2M 2 /n 2 . 
Let t = 16 losp+m . From (18) we have 

np2 v ' 

E\\H-EH\\ 2 <t 2 + E\\H -EH\\ 2 l( max ||#, {m) - E#, (m) II > t 

\l<l<p-m+l 

a , ( .2\ ^ „/logp + m x 2 



t 2 + o{t z )<C 



n 



by similar arguments as for (17). Therefore H has a negligible contribution 
to the risk. 
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3.2. Lower bound under the operator norm. Theorem 2 in Section 3.1 
shows that the optimal tapering estimator attains the rate of convergence 
n -2a/(2a+i) _|_ l2££ # j n this section we shall show that this rate of convergence 
is indeed optimal among all estimators by showing that the upper bound in 
equation (9) cannot be improved. More specifically we shall show that the 
following minimax lower bound holds. 

Theorem 3. Suppose p < exp^n) for some constant 7 > 0. The mini- 
max risk for estimating the covariance matrix X over V a under the operator 
norm satisfies 



inf supE||S - £|| 2 > cn~ 2a/i2a+1) + c 



logp 



2 V a n 

The basic strategy underlying the proof of Theorem 3 is to carefully con- 
struct a finite collection of multivariate normal distributions and calculate 
the total variation affinity between pairs of probability measures in the col- 
lection. 

We shall now define a parameter space that is appropriate for the minimax 
lower bound argument. For given positive integers k and m with 2k <p and 
1 < m < k, define the pxp matrix B(m, k) = (bij) pxp with 

bij = I{i = m and m + 1 < j < 2k, or j = m and m + 1 < i < 2k}. 

Set k = n 1 ^ 2 " 4 ' 1 ) and a = k~^ a+l \ We then define the collection of 2 k co- 
variance matrices as 

(19) Fn = ( : E(0) = I p + T aJ2 m B{m, fc), = (0 m ) G {0, l} k \ , 

{ m=l J 

where I p is the pxp identity matrix and < r < 2~ a ~ 1 M. Without loss of 
generality we assume that Mq > 1 and p> 1. Otherwise we replace I p in (19) 
by el p for < e < min{A/o, p}- For < r < 2~ a ~ l M it is easy to check that 
F\\ C J-" Q (Mo,M) as n — > 00. In addition to T\\ we also define a collection 
of diagonal matrices 

(20) Tvi = |s m :S m = / p + ?-\ogp 1 I{i = j = m}^ ,0<m<pij, 

where p\ = mm{p,e n / 2 } and < r < min{(Af — 1) 2 > (p ~ I) 2 , !}• Let T\ = 
T\\ U It is clear that T\ C T a (M ,M). 

We shall show below separately that the minimax risks over multivariate 
normal distributions with covariance matrix in (19) and (20) satisfy 

(21) inf supE||S - S|| 2 > cra- 2a/(2a+1) 

E ^11 
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and 

(22) infsupE||S-i;|| 2 >ci^ 

for some constant c> 0. Equations (21) and (22) together imply 

(23) inf supE||S - S|| 2 > - f„-2«/(2«+i) + 

E T\ 2 V n ) 

for multivariate normal distributions and this proves Theorem 3. We shall 
establish the lower bound (21) by using Assouad's lemma in Section 3.2.1 and 
the lower bound (22) by using Le Cam's method and a two-point argument 
in Section 3.2.2. 

3.2.1. A lower bound by Assouad's lemma. The key technical tool to 
establish equation (21) is Assouad's lemma in Assouad (1983). It gives a 
lower bound for the maximum risk over the parameter set G = {0, l} k to 
the problem of estimating an arbitrary quantity ip(9), belonging to a metric 
space with metric d. Let H(9,9') = Y2i=i \®i ~ @i\ be the Hamming distance 
on {0, l} k , which counts the number of positions at which 9 and 9' differ. 
For two probability measures P and Q with density p and q with respect 
to any common dominating measure fi, write the total variation affinity 
\\P A Q\\ = J pAqdfi. Assouad's lemma provides a minimax lower bound for 
estimating ip(0). 

Lemma 4 (Assouad). Let = {0, l} k and let T be an estimator based 
on an observation from a distribution in the collection {Pg,9 £ G}. Then for 
all s > 

max2 s E e d s (T,^(9))> min d L^M^Jl A min ||F e AP^II. 
eee v H(8,e>)>i H(0,0') 2 H(e,e')=i" " 

Assouad's lemma is connected to multiple comparisons. In total there 
are k comparisons. The lower bound has three factors. The first factor is 
basically the minimum cost of making a mistake per comparison, and the 
last factor is the lower bound for the total probability of making type I 
and type II errors for each comparison, and k/2 is the expected number of 
mistakes one makes when P# and P#/ are not distinguishable from each other 
when H(9,6') = 1. 

We now prove the lower bound (21). Let Xi,...,X n - <iV(O,E(0)) with 
S(6*) G Denote the joint distribution by Pq. Applying Assouad's lemma 
to the parameter space T\\, we have 

inf max 2 2 £ , #||S — £ 



i: fe{o,i} 



k 
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(24) 

> min 11 , min \\PgAPa, . 

H(e,e')>i H(6,6') 2 h{6,9')=i 

We shall state the bounds for the the first and third factors on the right- 
hand side of (24) in two lemmas. The proofs of these lemmas are given in 
Section 7. 

Lemma 5. Let X(#) be defined as in (19). Then for some constant c > 



mm 



S W;,?»'>^. 



H(e,e')>i H{6,e> 

Lemma 6. Let Xi, . . . ,X n U ~' N(0, E(0)) with Z(0) e Tn. Denote the 
joint distribution by Pq. Then for some constant c> 

min llPfl A Pg'll > c - 
H(8,8')=l 

It then follows from Lemmas 5 and 6 together, with the fact k = n l ^ 2a+l \ 
max 2 2 E e \\t - £(fl)|| 2 > -k 2 a 2 > Cl rT 2a ^ 2a+1 ) 
for some ci > 0. 



3.2.2. ^4 lower bound using Le Cam's method. We now apply Le Cam's 
method to derive the lower bound (22) for the minimax risk. Let X be an 
observation from a distribution in the collection {Pq,0 S 0} where = 
{6q,9i, . . . ,6 pi }. Le Cam's method, which is based on a two-point test- 
ing argument, gives a lower bound for the maximum estimation risk over 
the parameter set O. More specifically, let L be the loss function. Define 
r(6 ,6_ m ) = M t [L(t,6o) + L(t,6 m )] and r min = infi< m < Pl r(0 o ,O m ), and de- 
note P = i E£=i *V 

Lemma 7. Let T be an estimator of 6 based on an observation from a 
distribution in the collection {Po,0 £ = {9q, 9\, @ Pl }}, then 

supEL(T,0)>-r min ||P0 o AP||. 

8 2 

We refer to Yu (1997) for more detailed discussions on Le Cam's method. 
To apply Le Cam's method, we need to first construct a parameter set. 
For 1 < m <pi, let S m be a diagonal covariance matrix with a mm = 1 + 

\J T l0 n Pl ' aii = ^ ^ or * 7^ m ! an d be the identity matrix. Let X; = 
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(X[,X l 2 , . . . , X l p ) T ~ iV(0, £ m ), and denote the joint density of Xi, . . . , X n 
by fnn <m< pi with pi = max{p, e n / 2 }, which can be written as follows: 

fm= ]J MX))' II ^mOOi 

l<i<n,l<j<p,j^m l<i<n 

where 4> a , (7 = 1 or (J mm , is the density of N(0,a 2 ). Denote by fo the joint 
density of Xi, . . . , X n when X; ~ N(0, So). 

Let 9 m — S m for < m < p\ and the loss function L be the squared 
operator norm. It is easy to see r(6o,6 m ) = ^r lo ^ Pl for all 1 < m <p\. Then 
the lower bound (22) follows immediately from Lemma 7 if there is a constant 
c> such that 

(25) ||P eo AP||>c. 

Note that for any two densities go and q±, f go A gi dfi = 1 — ^ f |go — gi \ dfj,, 
and Jensen's inequality implies 



/ 



go - qi | dfi 



qo-qi 



qi 



gi d/i < 



(go ~ qif 

qi 



qi 



Hence J go A gi d[i > 1 — \ {J ^ d/i — l) 1 / 2 . To establish equation (25), it thus 
suffices to show that f (-j- Ylm=i fm) 2 / fo d\x — 1 — > 0, that is, 



pi 



(26) 



Pi 



m=l 



/o 



Pi 



m^j 



fmfj 



d/i - 1 0. 



We now calculate J ^"^ J d/i. For m / j it is easy to see 



fo 



When m = j, we have 



/ 2 (v/2^ mm 



-2n 



fmfj 

n 

KKn 



d/i - 1 = 0. 



cxp 



(•^m) 



1 1 

hr 



[l-(l-cr mm ) 2 ]-™/ 2 



1 — T 



log Pi 



-n/2 



dx* 



Thus 



pi i 

m=l 



4e 

P l m=l 



1 fodfi-1 
Jo 



14 



T. T. CAI, C.-H. ZHANG AND H. H. ZHOU 



(27) 



<— I 1 

Pi 



■ exp 



logpi 



n 



-n/2 



1 

Pi 



n 



log Pi - - log 



1 



logpi 



n 



1 

Pi 



>0 



for < t < 1, where the last step follows from the inequality log(l — x) > — 2x 
for < x < 1/2. Equation (27), together with Lemma 7, now immediately 
implies the lower bound given in (22). 



Remark 2. In covariance matrix estimation literature, it is commonly 
assumed that — > 0. See, for example, Bickel and Levina (2008a). The 
lower bound given in this section implies that this assumption is necessary 
for estimating the covariance matrix consistently under the operator norm. 



3.3. Discussion. Theorems 2 and 3 together show that the minimax risk 
for estimating the covariance matrices over the distribution space V a satis- 
fies, for p> n 1 /^ 1 ), 

(28) inf supEllE - £|| 2 x n^^^ + 

The results also show that the tapering estimator with tapering parame- 
ter k = n l ^ 2a+l ^ attains the optimal rate of convergence ? 7,- 2a /( 2a + 1 ) _|_ l^p. 

A few interesting points can be made on the optimal rate of convergence 
n -2«/(2o+i) _|_ l2££ # When the dimension p is relatively small, that is, logp = 
( n 1 /( 2a + 1 )), p has no effect on the convergence rate and the rate is purely 
driven by the "smoothness" parameter a. However, when p is large, that is, 
logp 3> n 1 /( 2a+1 ), p plays a significant role in determining the minimax rate. 

We should emphasize that the optimal choice of the tapering param- 
eter k x 

n l/(2 Q +l) ig 

different from the optimal choice for estimating the 
rows/columns as vectors under mean squared error loss. Straightforward 
calculation shows that in the latter case the best cutoff is k x n. 1 /( 2 ( a + 1 )) 
so that the tradeoff between the squared bias and the variance is optimal. 
With k x fi 1 /( 2a + 1 ) 5 the tapering estimator has smaller squared bias than 
the variance as a vector estimator of each row/column. 

It is also interesting to compare our results with those given in Bickel and 
Levina (2008a). A banding estimator with bandwidth k = (l2£P)V( 2 (a+i)) 
was proposed and the rate of convergence (liS£) a /( a + 1 ) was proved. It is 
easy to see that the banding estimator given in Bickel and Levina (2008a) 
is not rate optimal. Take, for example, a = 1/2 and p = e^™. Their rate is 
re" 1 / 6 , while the optimal rate in Theorem 1 is n" 1 / 2 . 
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It is instructive to take a closer look at the motivation behind the con- 
struction of the banding estimator in Bickel and Levina (2008a). Let the 
banding estimator be 

(29) ± B = (a* J I{\i-j\<k}) 

and denote S_b — E£# by V, and let V = (vij). An important step in the 
proof of Theorem 1 in Bickel and Levina (2008a) is to control the operator 
norm by the t\ norm as follows: 

E||SB-ESB|| 2 <E||S B -ES B ||? = Ef max 

\j=i,...,p^ J 

s n( k A V n k2lo %P 



Note that E[|%-|J{|i - j\ < k}} x 1/y/n, then E^ x k/^/n. It is then 
expected that E(maxj = i v ... p |%'|) 2 < C(-^^logp) 2 [see Bickel and Lev- 
ina (2008a) for details] and so 

E \\£-ml<c^p + ck- 2a . 

n 

An optimal tradeoff of k is then (l2££) 1 /( 2 ( Q + 1 )) which implies a rate of 
^k^p>-j- a /( Q +i) - n Theorem 1 in Bickel and Levina (2008a). This rate is slower 
than the optimal rate n - 2a /{ 2a + 1 ) _|_ l^p m Theorem 1. 

We have considered the parameter space J- a defined in (3). Other similar 
parameter spaces can also be considered. For example, in time series analysis 
it is often assumed the covariance \aij\ decays at the rate \i — j\~^ a+1 ^ for 
some a > 0. Consider the collection of positive-definite symmetric matrices 
satisfying the following conditions: 

g a = G a (.M ,M 1 ) 

(30) 

= {£ : \aij\ <M x \i- for % + j and A max (£) < M }, 

where A max (S) is the maximum eigenvalues of the matrix S. Note that 
G a (M , Mi) is a subset of F a (M Q ,M) as long as Mi < aM. Using virtually 
identical arguments one can show that 

inf su P E||S - £|| 2 x n - 2Q /( 2Q+1 ) + 

Let V' a = V a (Mo, M, p) denote the set of distributions of Xi that satisfies 
(7) and (30). 
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Remark 3. Both the tapering estimator proposed in this paper and 
banding estimator given in Bickel and Levina (2008a) are not necessarily 
positive-semidefinite. A practical proposal to avoid this would be to project 
the estimator E to the space of positive-semidefinite matrices under the 
operator norm. More specifically, one may first diagonalize E and then re- 
place negative eigenvalues by 0. The resulting estimator is then positive- 
semidefinite. 



3.3.1. The case of p <n 1 /( 2a+1 ). We have focused on the case p > n l '^' a+l ' 
in Sections 3.1 and 3.2. The case of p < n 1 ' ( 2q+1 ) can be handled in a similar 
way. The main difference is that in this case we no longer have a tapering 
estimator E^ with k = n 1// ( 2a+1 ) because k> p. Instead the maximum like- 
lihood estimator E* can be used directly. It is easy to show in this case 

(31) supE||E*-E|| 2 <C-. 

The lower bound can also be obtained by the application of Assouad's lemma 
and by using a parameter space that is similar to J-\\. To be more specific, 
for an integer 1 < m < p/2, define the p x p matrix B m = (bij) pxp with 

bij = I{i = m and m + 1 < j < p, or j = m and m + 1 < % < p}. 

Define the collection of 2 P//2 covariance matrices as 

(32) T* = (e(0) : E(0) = I P + r^— V e m B(m, k),9 = (9 m ) G {0, l} p / 2 1. 

Since p < n l ^ 2a+l \ then — ^ < 2 a+1 / 2 p~( a+1 \ Again it is easy to check 

F* C F a {M 0l M) when < r < 2' a - l M. The following lower bound then 
follows from the same argument as in Section 3.2.1: 

(33) infsupEllE- Ell 2 >cp(-4=l •-•ci>c 2 -. 

Equations (31) and (33) together yield the minimax rate of convergence for 
the case p < n x ^ 2a+x \ 

(34) infsupE||E-E|| 2 x-. 

£ v a n 

This, together with equation (28), gives the optimal rate of convergence: 

(35) inf supEllE - E|| 2 x min(n^ 2Q /( 2Q+1 ) + 

t v a I n n) 
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4. Rate optimality under the Frobenius norm. In addition to the opera- 
tor norm, the Frobenius norm is another commonly used matrix norm. The 
Frobenius norm is used in defining the numerical rank of a matrix which 
is useful in many applications, such as the principle component analysis. 
See, for example, Rudelson and Vershynin (2007). The Frobenius norm has 
also been used in the literature for measuring the accuracy of a covariance 
matrix estimator. See, for example, Lam and Fan (2007) and Ravikumar 
et al. (2008). In this section we consider the optimal rate of convergence 
for covariance matrix estimation under the Frobenius norm. The Frobenius 
norm of a matrix A = (a^ ) is defined as the £2 vector norm of all entries in 
the matrix 



This is equivalent to treating the matrix A as a vector of length p 2 . It is 
easy to see that the operator norm is bounded by the Frobenius norm, that 
is, \\A\\ < \\A\\ F . 

The following theorem gives the minimax rate of convergence for estimat- 
ing the covariance matrix X under the Frobenius norm based on the sample 
{Xi, . . . ,X n }. 

Theorem 4. The minimax risk under the Frobenius norm satisfies 



We shall establish below separately the minimax upper bound and mini- 
max lower bound. 

4.1. Upper bound under the Frobenius norm. We will only prove the up- 
per bound for the distribution set V' a given in (30). The proof for the pa- 
rameter space V a is slightly more involved by thresholding procedures as in 
Wavelet estimation. The minimax upper bound is derived by again consider- 
ing the tapering estimator (4). Under the Frobenius norm the risk function 
is separable. The risk of the tapering estimator can be bounded separately 
under the squared I2 loss for each row/column. This method has been com- 
monly used in nonparametric function estimation using orthogonal basis 
expansions. Since 




1 n2 1 

inf supE— ||E — x inf supE— ||X — X 

£ v a P E v a P 



(36) 




E<7jj = and Var (cjjj) < — 
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for the tapering estimator (4), we have 

K(wij<Tij - (Jij) 2 < (1 - Wijfafj + — 
It can be seen easily that 



f-t \2 2 i 2 C 



n 



Ie||e-e|||<- V 4 + - V 

y {(i,j)-k h <\i-j\} F {{i,j):\i-j\<k} 

= i?l + i?2 • 

The assumption A max (S) < Mq implies that an < Mq for all i. Since is 
also uniformly bounded for all i ^ j from assumption (30), we immediately 
have Ri<C^. 

It is easy to show that 

(37) \ £ *l<Ck->"-\ 

P {{i,j):k<\i-j\} 
where \<Jij\ < C\\i — j\~^ a+1 ^ for all i ^ j. Thus 

(38) E-IIS - S||| < CAT 2 "" 1 + C-< c 2 n^ 2a+1 VW a+1 » 

p n 

by choosing 

(39) jfc^nV^a+D) 

if n l ^ 2 ( a+l ^ < p, which is different from the choice of k for the operator 
norm in (15). If n 1 ^ 2 ^ a+1 ^ > p, we will choose k=p, then the bias part is 
and consequently 

E-||E-£||!<C-. 
p n 



Remark 4. For the parameter space V' a , under the Frobenius norm the 
optimal tapering parameter k is of the order n l ^ 2 ( a+l ^ . The rate of con- 
vergence of the tapering estimator with k x ?i 1 /( 2 ( Q + 1 )) under the operator 
norm is 

l«iP +n -«/(a+l) 
n 

which is slower than n _2a /( 2a + 1 ) _|_ l2££ m Similarly, the optimal pro- 
cedure under the operator norm is not rate optimal under the Frobenius 
norm. Therefore, the optimal choice of the tapering parameter k critically 
depends on the norm under which the estimation accuracy is measured. 
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Remark 5. Similarly for V' a , it can be shown that under the Probenius 
norm the banding estimator with k x n 1 ^ 2 ^" 1 " 1 ^ is rate optimal. Under the 
operator norm, Bickel and Levina (2008a) chose fcx (Is?££)i/( 2 ( a( + 1 )) f or the 
banding estimator which is close to 

n l/(2(a+l)) up 

to a logarithmic factor 

of p. On the other hand, it can be shown that for the parameter space 
V a no linear estimator can achieve the optimal convergence rate under the 
Frobenius norm. 



4.2. Lower bound under the Frobenius norm. It is sufficient to establish 
the lower bound for the parameter space V' a given in (30). Again the argu- 
ment for V a is similar. As in the case of estimation under the operator norm, 
we need to construct a finite collection of multivariate normal distributions 
with a parameter space Q2 C Q a such that 



infsupE— ||E — Slip > c— 
t g 2 V n 



for some c > when k = mm{n 1 ^ 2 ^ a+1 ^ ,p/2}. 

We construct Q2 as follows. Let < r < M be a constant. Define 



■ATU 



-1/2 



I{l<\i-j\<k}) 



for Oij =6ji = or 1}. 



'13 — vji 

It is easy to verify that Q 2 C Q a as n -> 00. Note that 9eG = {0, l}k P -k(k+l)/2_ 
Applying Assouad's lemma with d the Frobenius norm and s = 2 to the 
parameter space G27 we have 



2 l,i - 
max2 Eg— E - 

6»eg 2 p 

> min 

H{6,e')>i 



-n 
i/p\\no) 



\ 2 F kp-k(k + l)/2 



Note that 



mm — - 

H(e,e')>i p 



H(6,6' 
1||£(0)-£(0')|| 



min \\Pq A Pgi I 
_H"(0,6»')=1 



H(6,6>) 



1 [rn" 
mm 

H(e,e')>i p 



-1/212 



/ 12 

ijl 



H(0,0') 



It is easy to see that 



T 



-n 



ifep- A;(A; + l)/2 
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Lemma 8. Let P e be the joint distribution o/Xi, . . . ,X n iV(0, £(#)) 
with £(#) € Q2- Then for some constant c\ > we have 

min \\Pq A Pqi II > c\. 
H{e,e')=i 

We omit the proof of this lemma. It is very similar to and simpler than 
the proof of Lemma 6. 

From Lemma 8 we have for some c > 

(40) min \\P e A P 6 A\ > c 

H{8,e')=i 



>2 R D l Wt - T,(ft\\\1 > r.minl „-(2a+l)/(2(a+l)) P 



thus 

max2 2 £?fl-||S - £(0)||i > cmin<i n 
e&g 2 p { n 

which implies that the rate obtained in (38) is optimal. 

5. Estimation of the inverse covariance matrix. The inverse of the co- 
variance matrix S _1 is of significant interest in many statistical applications. 
The results and analysis given in Section 3 can be used to derive the optimal 
rate of convergence for estimating S _1 under the operator norm. 

For estimating the inverse covariance matrix S _1 we require the minimum 
eigenvalue of £ to be bounded away from zero. For 5 > 0, we define 

(41) L 5 = {S:A min (S)>5}. 

Let $ a = $ a (Mo,M,p,5) denote the set of distributions of Xi that satisfy 
(3), (7) and (41), and similarly, distributions in ffi a = ffi a (Mo, M, p,5) satisfy 
(7), (30) and (41). 

The following theorem gives the minimax rate of convergence for estimat- 
ing S" 1 . 

Theorem 5. The minimax risk of estimating the inverse covariance 
matrix S _1 satisfies 

(42) inf supEHiT 1 - S- 1 !! 2 X min(n- 2Q /( 2Q+1 ) + ^ £\ 

± p I n n) 

where *P denotes either or ffi a . 

Proof. We shall focus on the case p > n l ^ 2a+l \ The proof for the case 
of p < ?i 1 /( 2a+1 ) is similar. To establish the upper bound, note that 

XT 1 - S" 1 = S -1 (S - t)E~ x , 
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then 

_ x;- 1 !! 2 = ||e _1 (e - sjxr 1 !! 2 < ||e _1 || 2 ||e - s|| 2 ||s _1 || 2 . 

It follows from assumption (3) that ||E _1 || 2 < 5~ 2 . Note that P{||X-EE|| 2 > 
e} < Ap5 m exp(— ne 2 pi) for any e > which decays faster than any poly- 
nomial of n as shown in the proof of Lemmas 2 and 3. Let A m ; n (E) and 
A m i n (ES) be the smallest eigenvalues of E and EE, respectively. Then 
P(W£) < A min (ES)-e 1 /2) > P (|A min (S)-A ?lin (ES)| > e 1 ' 2 ) decays faster 
than any polynomial of n. Let < e < [A m i n (ES)/2] 2 and c = l/[A m i n (ES) — 
e 1 / 2 ], then P(||E _1 || > c) decays faster than any polynomial of n. Therefore, 



Ellxr 1 - it 1 !! 2 < f - ) Ells - e" 2 



2 

+ E[||E~ 1 || 2 ||E - EfHE^fJdlE- 1 !! > c)) 

<CmJn- 2 ^ 2 ^ + l ^,l). 
[ n n J 

The proof of the lower bound is almost identical to that of Theorem 1 
except that here we need to show 

min ^ V ' — V ; " >cka 2 

H(e,e>)>i H(6,0') 

instead of Lemma 5. For a positive definite matrix A, let A m i n (A) denote 
the minimum eigenvalue of A. Since 

£-1(0) _ ^\e') = srVx^fl) - s(e'))s _1 (fl), 

we have 

|| S -i(0) _ E _1 (6 ,/ )|| > A min (S- 1 (e))A min (S- 1 (0 / ))l|SW - E(0')ll- 
Note that 

A^nCE- 1 ^)) > 1/M , A min (E~ 1 (6/ / )) > 1/M , 

then Lemma 5 implies 

llE-^-E- 1 ^')!! 2 ||E(0)-E(0')|| 2 , 2 

H(e,e')>i H{9,6') u H(0,e')>i H(9,9') 

for some constant c > 0. □ 
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6. Simulation study. We now turn to the numerical performance of the 
proposed tapering estimator and compare it with that of the banding estima- 
tor of Bickel and Levina (2008a). In the numerical study, we shall consider 
estimating a covariance matrix in the parameter space T a defined in (3). 
Specifically, we consider the covariance matrix £ = {&ij)±<i j<p of the form 



Note that this is a Toeplitz matrix. But we do not assume that the structure 
is known and do not use the information in any estimation procedure. 

The banding estimator in (29) depends on the choice of k. An optimal 
tradeoff of k is k x (n/logp) 1 ^ 20 " 1 " 2 ^ as discussed in Section 3.3. See Bickel 
and Levina (2008a). The tapering estimator (6) also depends on k for which 
the optimal tradeoff is k x n 1 ^ 2a+1 \ In our simulation study, we choose 
k = [(n/logp) 1 ^ 20 ^ J for the banding estimator and k = [n 1 ^ 2 "^ \ for 
the tapering estimator. 

A range of parameter values for a, n and p are considered. Specifically, a 
ranges from 0.1 to 0.5, the sample size n ranges from 250 to 3000 and the 
dimension p goes from 250 to 3000. We choose the value of p to be p = 0.6 
so that all matrices are nonnegative definite and their smallest eigenvalues 
are close to 0. Table 1 reports the average errors under the spectral norm 
over 100 replications for the two procedures. The cases where the tapering 
estimator underperforms the banding estimator are highlighted in boldface. 
Figure 2 plots the ratios of the average errors of the banding estimator to the 
corresponding average errors of the tapering estimator for a = 0.1,0.2,0.3 
and 0.5. The case of a = 0.4 is similar to the case of a = 0.3. 

It can be seen from Table 1 and Figure 2 that the tapering estimator 
outperforms the banding estimator in 121 out of 125 cases. For the given 
dimension p, the ratio of the average error of the banding estimator to the 
corresponding average error of the tapering estimator tends to increase as 
the sample size n increases. The tapering estimator fails to outperform the 
banding estimator only when a = 0.5 and n = 250 in which case the values 
of k are small for both estimators. 

Remark 6. We have also carried out additional simulations for larger 
values of a with the same sample sizes and dimensions. The performance of 
the tapering and abnding estimators are similar. This is mainly dur to the 
fact that the values of k for both estimators are very small for large a when 
n and p are only moderately large. 

7. Proofs of auxiliary lemmas. In this section we give proofs of auxiliary 
lemmas stated and used in Sections 3-5. 



(43) 




1 < i = j < P, 
l<iy^j<P- 
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Table 1 

The average errors under the spectral norm of the banding estimator (BL) and the 
tapering estimator ( CZZ) over 100 replications. The cases where the tapering estimator 
underperforms the banding estimator are highlighted in italic 



p 


n 




a — 


0.1 




a = 


0.2 




a = 


0.3 




a = 


0.4 




a = 


0.5 


BL 


CZZ 


BL 


CZZ 


BL 


CZZ 


BL 


CZZ 


BL 


CZZ 


250 


250 


2 


.781 


2.706 


2 


.291 


2.023 


1 


.762 


1.684 


1 


,618 


1.517 


1. 


325 


1.507 




500 


2 


.409 


2.302 


1 


.898 


1.575 


1 


.562 


1.204 


1 


,361 


1.185 


1 


.080 


0.822 




1000 


2 


.029 


1.685 


1 


.631 


1.361 


1 


.289 


1.018 


1 


.056 


0.795 





.911 


0.859 




2000 


1 


.706 


1.153 


1 


.369 


1.122 


1 


.106 


0.908 





.878 


0.655 





.715 


0.542 




oUUU 


1 


.522 




1 


.242 


U.oyD 





.983 


U. /yo 





.810 


U.ooo 





.645 


U.4oz 


500 


250 


3 


.277 


2.914 


2 


.609 


2.097 


1 


.961 


1.788 


1 


,745 


1.610 


1. 


392 


1.571 




500 


2 


.901 


2.598 


2 


.199 


1.683 


1 


.751 


1.256 


1 


,475 


1.234 


1 


.152 


0.865 




1000 


2 


.539 


2.197 


1 


.942 


1.472 


1 


.481 


1.064 


1 


.178 


0.843 





.984 


0.917 




zuuu 


2 


.263 


1 79^ 
1. ( ZO 


1 


.669 


1 ^9fi 
l.oZD 


1 


.293 


u.yoo 


1 


.067 


u. / uu 





.866 


u.ooy 




3000 


2 


.066 


1.379 


1 


.538 


1.154 


1 


.220 


0.874 





.919 


0.696 





.781 


0.503 


1000 


250 


3 


.747 


3.086 


2 


.873 


2.223 


2 


.385 


1.842 


1 


.833 


1.694 


1. 


449 


1.643 




500 


3 


.370 


2.735 


2 


.635 


1.768 


1 


.906 


1.334 


1 


.565 


1.297 


1 


.203 


0.925 




1000 


3 


.097 


2.437 


2 


.315 


1.536 


1 


.741 


1.121 


1 


.382 


0.883 


1 


.037 


0.936 




2000 


2 


.730 


2.177 


2 


.011 


1.392 


1 


.523 


1.006 


1 


.156 


0.722 





.920 


0.591 




3000 


2 


.589 


1.968 


1 


.865 


1.264 


1 


.374 


0.911 


1 


.072 


0.723 





.834 


0.523 


2000 


250 


4 


.438 


3.177 


3 


.107 


2.300 


2 


.511 


1.956 


1 


.903 


1.744 


1. 


484 


1.736 




500 


3 


.969 


2.800 


2 


.868 


1.841 


2 


.030 


1.383 


1 


.638 


1.356 


1 


.239 


0.940 




1000 


3 


.538 


2.531 


2 


.551 


1.599 


1 


.866 


1.158 


1 


.452 


0.912 


1 


.074 


0.973 




2000 


3 


.242 


2.353 


2 


.248 


1.434 


1 


.649 


1.031 


1 


.224 


0.751 





.955 


0.611 




3000 


3 


.025 


2.219 


2 


.101 


1.302 


1 


.566 


0.929 


1 


.141 


0.743 





.868 


0.541 


3000 


250 


4 


.679 


3.219 


3 


.230 


2.358 


2 


.576 


1.995 


1 


.931 


1.797 


1. 


494 


1.776 




500 


4 


.214 


2.887 


2 


.991 


1.890 


2 


.282 


1.419 


1 


.664 


1.384 


1 


.463 


0.971 




1000 


3 


.901 


2.575 


2 


.674 


1.633 


1 


.933 


1.186 


1 


.482 


0.929 


1 


.224 


0.990 




2000 


3 


.488 


2.395 


2 


.452 


1.451 


1 


.717 


1.049 


1 


.254 


0.768 





.965 


0.619 




3000 


3 


.336 


2.278 


2 


.288 


1.321 


1 


.632 


0.948 


1 


.172 


0.750 





.880 


0.549 



Proof of Lemma 1. Without loss of generality we assume that i<j. 
The set {i,j} is contained in the set {I, . . . , l+kh — 1} if and only if I < i < j < 
l + kh — ^-i that is, j — kh + 1 < I < i ■ Note that Card{Z : j — kh + 1 < I < i} = 
(* - U - k h + 1) + 1)+ = (k h ~ \i ~ then Card{/ : {i,j} C {/,...,/ + k h - 
1}} = (kh — \i — Similarly, we have Card{/ : {i, j} C {I, . . . , I + k — 1}} = 
(k — \i — j\)+- Thus we have 

kwij = (k- \i-j\)+ - (k h - 

= Card{/ :{i,j}c{l,...,l + k- 1}} 



- Card{/ : {i,j} C {/,..., I + k h - 1}}. 



□ 
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alpha=0.1 alpha=0.2 




p=250 p=500 p=1000 p=2000 p=3000 p=250 p=500 p=1000 p=2000 p=3000 



alpha= 0.3 alpha= 0.5 




p=250 p=500 p=1000 p=2000 p=3000 p=250 p=500 p=1000 p=2000 p=3000 

Fig. 2. The vertical bars represent the ratios of the average error of the banding estimator 
to the corresponding average error of the tapering estimator. The higher the bar the better 
the relative performance of the tapering estimator. For each value of p the bars are ordered 
from left to right by the sample sizes (n = 250 to 3000) . 
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Proof of Lemma 2. Without loss of generality we assume that p is 



divisible by m. Recall that M t 

Am) 



(dijl{l < i < I + m,l < j < I + m}) pxp . 
Note that MY"'' is empty when I < 1 — m, and has at least one nonzero 



entry when l>2 — m. Set <5, 
It follows from (6) that 



(m) 



MP-EMP and S^=J2j 



P 2-mM t 



(m) 



(44) 



/=! —l<j<p/m 



Since <5^_|_j are disjoint diagonal blocks over —1 <j <p/m, we have 



l|S (r 



(45) 



E5 (m) || < m max 

K/<m 



< m max 



-l<j'<p/m 

iuMii 



1— m<l<p 



Since 5 



(*h) 



and £, are all sub-blocks of certain matrix <5, w with 1 < / < 



J-^ OjIIVJ. L/£ CXiL C CI 11 O LI IV U1UV.1VO Ul l/Cl UCllll 111(1' I I IjV 

p — A; + 1, Lemma 2 now follows immediately from equations (45) and (6). 
□ 



Proof of Lemma 3. For any m x m symmetric matrix A, we have 

\u T Au\ - \v T Av\ < \u T Au - v T Av\ = \(u - v) T A(u + v)\ 

< \\u — v\\ \\A\\ \\u + v\\. 

Let S™^ 1 be a 1/2 net of the unit sphere S" 1 " 1 in the Euclidean distance 
in R m . We have 



\A\\< sup \u T Au\ < sup \u T Au\ + -||^4||- 



u£S r ' 



i/-> 



sup \u Au\ + — \\A\\, 

m-l 4 



«e5 1/2 

which implies ||^4|| < 4sup, cq m-i\u T Au\. Since we are allowed to pack 
Card(5™~ 1 ) balls of radius 1/4 into a 1 + 1/4 ball in R m , volume com- 



parison yields 



(l/4) m Card(5 1 m / - 1 )< (5/4) r 



that is, Card(S'™ 2 ) < 5 m . Thus there exist vi, V2, . . . , v^™ £ S rn such that 



\A\\ < 4 sup \vjAvj 

j<5 m 



for all m x m symmetric A. 
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This one-step approximation argument is similar to the proof of Proposition 
4.2(H) in Zhang and Huang (2008). 

Let Xi, . . . ,X n be i.i.d. p- vectors with E(Xi— /x)(Xi— p) T = S. Under the 
sub-Gaussian assumption in (7) there exists p > such that 

P{v T (Xi - EXi)(Xj - EXi) T v >x}< e~ xp/2 for all x > and ||v|| = 1, 

which implies E(tv T (Xi -EX;)(Xi -EX;) T v) < oo for all t < p/2 and ||v|| = 
1, then there exists p\ > such that 



■j ^v T [(X-EX 4 )(X 4 -EX) T -S]v >x| 



< e -nx 2 Pl /2 



for all < x < pi and ||v|| = 1. [See, e.g., Chapter 2 in Saulis and Stat- 
ulevicius (1991).] Thus we have 



max ||M, (m) -EM, (m) || >x 

l<l<p-m+l 1 1 



l<l<p— 771+1 



(m) 



EM 



(m) I 



> x} 



< 2p5 m supP{|vJ(M^ - EMP)v y | > x} 



< 2p5 m exp(-nx 2 pi/2). □ 

Proof of Lemma 5. Set v = (l{k h < i < k}) and let 

(wt) = [£(0) - E(0')]«. 

Note that there are exactly H (9, 6') number of W{ such that \wj\ = rkha, and 
|| u ||| = kh- This implies 

l| E(() ) _ E(0| . > IPW-yiHI > • 



A:/, 



H(e,e')-T 2 k h a 2 . 



□ 



Proof of Lemma 6. When H(9,8') = 1, we will show 
||P^-P e ||?<2^(P^|^) 



= 2n 

2 

< n • c/ca 2 



for some small c > 0, where is the Kullback-Leibler divergence and 

the first inequality follows from the well-known Pinsker's inequality [see, 
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e.g., Csiszar (1967)]. This immediately implies the L\ distance between two 
measures is bounded away from 1, and then the lemma follows. Write 

Z(6')=D 1 + X(6). 

Then 

itr(S(0OS- 1 W)-| = itrp 1 E- 1 W). 

Let Aj be the eigenvalues of DiYj~ 1 (Q). Since Di'E~ 1 (9) is similar to the 
symmetric matrix Y>~ 1 / 2 (9)DiY,~ 1 / 2 (6), and 

\\T 1 - 1 i 2 {e)D l T,- 1 ' 2 {e)\\ < ns- 1 / 2 ^)!!!!!?!!!!^- 1 / 2 ^)!! 

< ci||Z>i|| < ci||I?i||i < C2&a, 

then all eigenvalues Aj's are real and in the interval [— C2ka, C2ka], where 
ka = k ■ k~( a+1 ^ = k~ a — > 0. Note that the Taylor expansion yields 

logdet(S(0')S~ 1 (0)) = logdet(/ + DiS- 1 ^)) = tr(DiE _1 (0)) - R 3 , 

where 

p 

R3 < C3 f° r some C3 > 0. 

Write S~ 1 / 2 (6l) = UV 1/2 U T , where UU T = 1 and V is a diagonal matrix. It 
follows from the fact that the Frobenius norm of a matrix remains the same 
after an orthogonal transformation that 

^A 2 = IIE- 1 / 2 ^)!^- 1 / 2 ^)!! 2 < ||y|| 2 • W^D^Wl 



i=l 



I^T 1 ^)!! 2 • \\Di\\ 2 F < c 4 ka 2 . □ 
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